Modeling long temporal contexts for robust DNN-based speech recognition
نویسندگان
چکیده
Deep Neural Networks (DNNs) have been shown to outperform traditional Gaussian Mixture Models in many Automatic Speech Recognition tasks. In this work, we investigate the potential of modeling long temporal acoustic contexts using DNNs. The complete temporal context is split into several subcontexts. Multiple sub-context DNNs initialized with the same set of Restricted Boltzmann Machines are fine-tuned independently and their last hidden layer activations are combined to jointly predict the desired state posteriors through a single softmax output layer. From preliminary experiments on the Aurora2 multi-style training task, our proposed system models a 65-frame temporal window of speech signals and yields a 4.4% WER, outperforming the best single DNN by 12.0% relatively. With the local independence assumption, both training and testing of the sub-context DNNs can be done in parallel. Moreover, our system has a relative 48.2% parameter reduction compared to a single DNN with the same amount of hidden units.
منابع مشابه
Off-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملImproving Phoneme Sequence Recognition using Phoneme Duration Information in DNN-HSMM
Improving phoneme recognition has attracted the attention of many researchers due to its applications in various fields of speech processing. Recent research achievements show that using deep neural network (DNN) in speech recognition systems significantly improves the performance of these systems. There are two phases in DNN-based phoneme recognition systems including training and testing. Mos...
متن کاملF0 Contour Analysis Based on Empirical Mode Decomposition for DNN Acoustic Modeling in Mandarin Speech Recognition
Tone information provides a strong distinction for many ambiguous characters in Mandarin Chinese. The use of tonal acoustic units and F0 related tonal features have been shown to be effective at improving the accuracy of Mandarin automatic speech recognition (ASR) systems, as F0 contains the most prominent tonal information for distinguishing words that are phonemically identical. Both long-ter...
متن کاملLong Short-Term Memory for Speaker Generalization in Supervised Speech Separation
Speech separation can be formulated as learning to estimate a time-frequency mask from acoustic features extracted from noisy speech. For supervised speech separation, generalization to unseen noises and unseen speakers is a critical issue. Although deep neural networks (DNNs) have been successful in noise-independent speech separation, DNNs are limited in modeling a large number of speakers. T...
متن کامل